As an Machine Learning (ML) Newbie, I need to figure out the best way to prepare the dataset for our machine learning training model. As per my last article, I came up with a Python package for this process!
Whenever you are training a custom model the important thing is the dataset. Yes, of course, the dataset plays the main role in deep learning. The accuracy of your model will be based on the dataset. So, before you train a custom model, you need to plan how to build dataset? Here, I’m going to share my ideas on the easy way to build your dataset.
MLDatasetBuilder-Version 1.0.0
A Python package to build Dataset for Machine Learning
Whenever we begin a machine learning project, the first thing that we need is a dataset. Datasets will be the pillar of the training model. You can build the dataset either automatically or manually. MLDatasetBuilder is a python package which is helping to prepare the image for your ML dataset.
Github Repo: karthick965938/ML-Dataset-Builder
Installation
We can install MLDatasetBuilder package using the below command
pip install MLDatasetBuilder
How to test?
When you run python3
in the terminal, it will produce output like this:
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Run the following code to you can get the Initialize process output for the MLDatasetBuilder package.
>>> from MLDatasetBuilder import * >>> MLDatasetBuilder()
Available Operations
PrepareImage — Remove unwanted format images and Rename your images
#PrepareImage(folder_name, image_name)
PrepareImage('images', 'dog')
ExtractImages — Extract images from video file
#ExtractImages(video_path, file_name, frame_size)
ExtractImages('video.mp4', 'frame', 10)
#OR
#ExtractImages(video_path, filename)
ExtractImages('video.mp4', 'frame')
#Default FPS will be 5
Step1 — Get images from google
Yes, we can get images from Google. Using the Download All Images browser extension we can easily get images in a few minutes. You can check out here for more details about this extension!
Step2 — Create a Python file
Once you have downloaded the images using this extension, you can create a python file called test.py the same directory as below.
download_image_folder/
| _14e839ba-9691-11ea-a968-2ed746e9a968.jpg
| 5e5f7af12600004018b602c0.jpeg
| A471529_Alice_b-1.jpg
| image1.png
| image2.png
| ...
test.py
Inside the images folder, you can see lots of PNG images and random filenames.
Step3 — PrepareImage
MLDatasetBuilder provides a method called PrepareImage. Using this method to we can remove the unwanted images and rename your image files which are already you have downloaded from the browser’s extensions.
PrepareImage(folder_path, class_name) #PrepareImage('download_image_folder', 'dog')
As per the above code, we need to mention the image folder path and class name.
After completing the process your image folder structure will look like below
download_image_folder/
| dog_0.jpg
| dog_1.jpg
| dog_2.jpg
| dog_3.png
| dog_4.png
| ...
test.py
This process very helps to annotate your images while labelling. And of course, it will be like one of the standardized things.
Step4 — ExtractImage
MLDatasetBuilder also provides a method called ExtractImages. Using this method we can extract the images from the video files.
download_image_folder/
video.mp4
test.py
As per the below code, we need to mention the video path, folder name, and framesize. Folder name will the class name and framesize’s default value 5 and it’s not mandatory.
ExtractImages(video_path, folder_name, framesize) #ExtractImages('video.mp4', 'frame', 10) ExtractImages(video_path, folder_name) #ExtractImages('video.mp4', 'frame')
After completing the process your image folder structure will look like below
download_image_folder/
dog/
| dog_0.jpg
| dog_1.jpg
| dog_2.jpg
| dog_3.png
| dog_4.png
| ...
dog.mp4
test.py
What is version 2.0.0?
I have planned to release version 2.0.0 on next month, This will include some additional features.
I mean this package will provide more than 100 objects images with annotations file 🙂
Contributing
All issues and pull requests are welcome! To run the code locally, first, fork the repository and then run the following commands on your computer:
git clone https://github.com/<your-username>/ML-Dataset-Builder.git
cd ML-Dataset-Builder
# Recommended creating a virtual environment before the next step
pip3 install -r requirements.txt
When adding code, be sure to write unit tests where necessary.
Contact
MLDatasetBuilder was created by Karthick Nagarajan. Feel free to reach out on Twitter Linkedin or through Email!